Learning Rate
   HOME

TheInfoList



OR:

In
machine learning Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence. Machine ...
and
statistics Statistics (from German language, German: ''wikt:Statistik#German, Statistik'', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of ...
, the learning rate is a tuning parameter in an
optimization algorithm Mathematical optimization (alternatively spelled ''optimisation'') or mathematical programming is the selection of a best element, with regard to some criterion, from some set of available alternatives. It is generally divided into two subfi ...
that determines the step size at each iteration while moving toward a minimum of a
loss function In mathematical optimization and decision theory, a loss function or cost function (sometimes also called an error function) is a function that maps an event or values of one or more variables onto a real number intuitively representing some "cost ...
. Since it influences to what extent newly acquired information overrides old information, it metaphorically represents the speed at which a machine learning model "learns". In the
adaptive control Adaptive control is the control method used by a controller which must adapt to a controlled system with parameters which vary, or are initially uncertain. For example, as an aircraft flies, its mass will slowly decrease as a result of fuel consumpt ...
literature, the learning rate is commonly referred to as gain. In setting a learning rate, there is a trade-off between the rate of convergence and overshooting. While the
descent direction In optimization, a descent direction is a vector \mathbf\in\mathbb R^n that, in the sense below, moves us closer towards a local minimum \mathbf^* of our objective function f:\mathbb R^n\to\mathbb R. Suppose we are computing \mathbf^* by an iterati ...
is usually determined from the
gradient In vector calculus, the gradient of a scalar-valued differentiable function of several variables is the vector field (or vector-valued function) \nabla f whose value at a point p is the "direction and rate of fastest increase". If the gradi ...
of the loss function, the learning rate determines how big a step is taken in that direction. A too high learning rate will make the learning jump over minima but a too low learning rate will either take too long to converge or get stuck in an undesirable local minimum. In order to achieve faster convergence, prevent oscillations and getting stuck in undesirable local minima the learning rate is often varied during training either in accordance to a learning rate schedule or by using an adaptive learning rate. The learning rate and its adjustments may also differ per parameter, in which case it is a
diagonal matrix In linear algebra, a diagonal matrix is a matrix in which the entries outside the main diagonal are all zero; the term usually refers to square matrices. Elements of the main diagonal can either be zero or nonzero. An example of a 2×2 diagonal ma ...
that can be interpreted as an approximation to the inverse of the
Hessian matrix In mathematics, the Hessian matrix or Hessian is a square matrix of second-order partial derivatives of a scalar-valued function, or scalar field. It describes the local curvature of a function of many variables. The Hessian matrix was developed ...
in
Newton's method In numerical analysis, Newton's method, also known as the Newton–Raphson method, named after Isaac Newton and Joseph Raphson, is a root-finding algorithm which produces successively better approximations to the roots (or zeroes) of a real-valu ...
. The learning rate is related to the step length determined by inexact
line search In optimization, the line search strategy is one of two basic iterative approaches to find a local minimum \mathbf^* of an objective function f:\mathbb R^n\to\mathbb R. The other approach is trust region. The line search approach first finds a ...
in
quasi-Newton method Quasi-Newton methods are methods used to either find zeroes or local maxima and minima of functions, as an alternative to Newton's method. They can be used if the Jacobian or Hessian is unavailable or is too expensive to compute at every iteration. ...
s and related optimization algorithms. When conducting line searches, mini-batch sub-sampling (MBSS) affect the characteristics of the loss function along which the learning rate needs to be resolved. Static MBSS keeps the mini-batch fixed along a search direction, resulting in a smooth loss function along the search direction. Dynamic MBSS updates the mini-batch at every function evaluation, resulting in a point-wise discontinuous loss function along the search direction. Line searches that adaptively resolve learning rates for static MBSS loss functions include the parabolic approximation line (PAL) search. Line searches that adaptively resolve learning rates for dynamic MBSS loss functions include probabilistic line searches, gradient-only line searches (GOLS) and quadratic approximations.


Learning rate schedule

Initial rate can be left as system default or can be selected using a range of techniques. A learning rate schedule changes the learning rate during learning and is most often changed between epochs/iterations. This is mainly done with two parameters: decay and momentum . There are many different learning rate schedules but the most common are time-based, step-based and exponential. Decay serves to settle the learning in a nice place and avoid oscillations, a situation that may arise when a too high constant learning rate makes the learning jump back and forth over a minimum, and is controlled by a hyperparameter. Momentum is analogous to a ball rolling down a hill; we want the ball to settle at the lowest point of the hill (corresponding to the lowest error). Momentum both speeds up the learning (increasing the learning rate) when the error cost gradient is heading in the same direction for a long time and also avoids local minima by 'rolling over' small bumps. Momentum is controlled by a hyper parameter analogous to a ball's mass which must be chosen manually—too high and the ball will roll over minima which we wish to find, too low and it will not fulfil its purpose. The formula for factoring in the momentum is more complex than for decay but is most often built in with deep learning libraries such as
Keras Keras is an open-source software library that provides a Python interface for artificial neural networks. Keras acts as an interface for the TensorFlow library. Up until version 2.3, Keras supported multiple backends, including TensorFlow, Mic ...
. Time-based learning schedules alter the learning rate depending on the learning rate of the previous time iteration. Factoring in the decay the mathematical formula for the learning rate is: \eta_ = \frac where \eta is the learning rate, d is a decay parameter and n is the iteration step. Step-based learning schedules changes the learning rate according to some pre defined steps. The decay application formula is here defined as: \eta_ = \eta_0d^ where \eta_ is the learning rate at iteration n, \eta_0 is the initial learning rate, d is how much the learning rate should change at each drop (0.5 corresponds to a halving) and r corresponds to the droprate, or how often the rate should be dropped (10 corresponds to a drop every 10 iterations). The ''
floor A floor is the bottom surface of a room or vehicle. Floors vary from simple dirt in a cave to many layered surfaces made with modern technology. Floors may be stone, wood, bamboo, metal or any other material that can support the expected load ...
'' function (\lfloor\dots\rfloor) here drops the value of its input to 0 for all values smaller than 1. Exponential learning schedules are similar to step-based but instead of steps a decreasing exponential function is used. The mathematical formula for factoring in the decay is: \eta_ = \eta_0e^ where d is a decay parameter.


Adaptive learning rate

The issue with learning rate schedules is that they all depend on hyperparameters that must be manually chosen for each given learning session and may vary greatly depending on the problem at hand or the model used. To combat this there are many different types of
adaptive Adaptation, in biology, is the process or trait by which organisms or population better match their environment Adaptation may also refer to: Arts * Adaptation (arts), a transfer of a work of art from one medium to another ** Film adaptation, a ...
gradient descent algorithms such as
Adagrad Stochastic gradient descent (often abbreviated SGD) is an iterative method for optimizing an objective function with suitable smoothness properties (e.g. differentiable or subdifferentiable). It can be regarded as a stochastic approximation of gr ...
, Adadelta, RMSprop, and
Adam Adam; el, Ἀδάμ, Adám; la, Adam is the name given in Genesis 1-5 to the first human. Beyond its use as the name of the first man, ''adam'' is also used in the Bible as a pronoun, individually as "a human" and in a collective sense as " ...
which are generally built into deep learning libraries such as
Keras Keras is an open-source software library that provides a Python interface for artificial neural networks. Keras acts as an interface for the TensorFlow library. Up until version 2.3, Keras supported multiple backends, including TensorFlow, Mic ...
.


See also

*
Hyperparameter (machine learning) In machine learning, a hyperparameter is a parameter whose value is used to control the learning process. By contrast, the values of other parameters (typically node weights) are derived via training. Hyperparameters can be classified as model hy ...
*
Hyperparameter optimization In machine learning, hyperparameter optimization or tuning is the problem of choosing a set of optimal hyperparameters for a learning algorithm. A hyperparameter is a parameter whose value is used to control the learning process. By contrast, the va ...
* Stochastic gradient descent *
Variable metric methods Quasi-Newton methods are methods used to either find zeroes or local maxima and minima of functions, as an alternative to Newton's method. They can be used if the Jacobian or Hessian is unavailable or is too expensive to compute at every iteration ...
*
Overfitting mathematical modeling, overfitting is "the production of an analysis that corresponds too closely or exactly to a particular set of data, and may therefore fail to fit to additional data or predict future observations reliably". An overfitt ...
*
Backpropagation In machine learning, backpropagation (backprop, BP) is a widely used algorithm for training feedforward neural network, feedforward artificial neural networks. Generalizations of backpropagation exist for other artificial neural networks (ANN ...
*
AutoML Automated machine learning (AutoML) is the process of automating the tasks of applying machine learning to real-world problems. AutoML potentially includes every stage from beginning with a raw dataset to building a machine learning model ready ...
*
Model selection Model selection is the task of selecting a statistical model from a set of candidate models, given data. In the simplest cases, a pre-existing set of data is considered. However, the task can also involve the design of experiments such that the ...
*
Self-tuning In control theory a self-tuning system is capable of optimizing its own internal running parameters in order to maximize or minimize the fulfilment of an objective function; typically the maximization of efficiency or error minimization. Self-tu ...


References


Further reading

* *


External links

*{{cite web , first=Nando , last=de Freitas , title=Optimization , work=Deep Learning Lecture 6 , location=University of Oxford , date=February 12, 2015 , url=https://www.youtube.com/watch?v=0qUAb94CpOw&list=PLE6Wd9FR--EfW8dtjAuPoTuPcqmOV53Fu&index=9 , via=
YouTube YouTube is a global online video platform, online video sharing and social media, social media platform headquartered in San Bruno, California. It was launched on February 14, 2005, by Steve Chen, Chad Hurley, and Jawed Karim. It is owned by ...
Machine learning Model selection Optimization algorithms and methods